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Abstract 

Recurrent Neural Networks (RNNs) have become increasingly 
popular for the task of language understanding. In this task, 
a semantic tagger is deployed to associate a semantic label to 
each word in an input sequence. The success of RNN may be 
attributed to its ability to memorize long-term dependence that 
relates the current-time semantic label prediction to the observa¬ 
tions many time instances away. However, the memory capac¬ 
ity of simple RNNs is limited because of the gradient vanishing 
and exploding problem. We propose to use an external memory 
to improve memorization capability of RNNs. We conducted 
experiments on the ATIS dataset, and observed that the pro¬ 
posed model was able to achieve the state-of-the-art results. We 
compare our proposed model with alternative models and report 
analysis results that may provide insights for future research. 
Index Terms: Recurrent Neural Network, Language Under¬ 
standing, Long Short-Term Memory, Neural Turing Machine 

1. Introduction 

Neural network based methods have recently demonstrated 
promising results on many natural language processing tasks [T] 
[2j. Specifically, recurrent neural networks (RNNs) based meth¬ 
ods have shown strong performances, for example, in language 
modeling 0, language understanding |4j, and machine transla¬ 
tion tasks. 

The main task of a language understanding (LU) system is 
to associate words with semantic meanings |7Jj9]. For example, 
in the sentence "Please book me a ticket from Hong Kong to 
Seattle”, a LU system should tag "Hong Kong” as the departure- 
city of a trip and "Seattle” as its arrival city. The widely used 
approaches include conditional random fields (CRFs) (8|[lQ), 
support vector machine 03’ and, more recently, RNNs |4)|12|. 

A RNN consists of an input, a recurrent hidden layer, and 
an output layer. The input layer reads each word and the output 
layer produces probabilities of semantic labels. The success of 
RNNs can be attributed to the fact that RNNs, if successfully 
trained, can relate the current prediction with input words that 
are several time steps away. However, RNNs are difficult to 
train, because of the gradient vanishing and exploding prob¬ 
lem 03- The problem also limits RNNs’ memory capacity 
because error signals may not be able to back-propagated far 
enough. 

There have been two lines of researches to address this 
problem. One is to design learning algorithms that can avoid 
gradient exploding, e.g., using gradient clipping |14| , and/or 
gradient vanishing, e.g., using second-order optimization meth¬ 
ods G3- Alternatively, researchers have proposed more ad¬ 
vanced model architectures, in contrast to the simple RNN that 


uses, e.g., Elman architecture 00 - Specifically, the long short¬ 
term memory (LSTM) m3 neural networks have three gates 
that control flows of error signals. The recently proposed gated 
recurrent neural networks (GRNN) 0 may be considered as a 
simplified LSTM with fewer gates. 

Along this line of research on developing more advanced 
architectures, this paper focuses on a novel neural network ar¬ 
chitecture. Inspired by the recent work in [19| , we extend the 
simple RNN with Elman architecture to using an external mem¬ 
ory. The external memory stores the past hidden layer activities, 
not only from the current sentence but also from past sentences. 
To predict outputs, the model uses input observation together 
with a content retrieved from the external memory. The pro¬ 
posed model performs strongly on a common language under¬ 
standing dataset and achieves new state-of-the-art results. 

This paper is organized as follows. We briefly describe 
background of this research in Sec. [2] Section [3] presents de¬ 
tails of the proposed model. Experiments are in section]?] We 
relate our research with other works in Sec. [5] Finally, we have 
conclusions and discussions in Sec. [6] 

2. Background 

2.1. Language understanding 

A language understanding system predicts an output sequence 
with tags such as named-entity given an input sequence words. 
Often, the output and input sequences have been aligned. In 
these alignments, an input may correspond to a null tag or a 
single tag. An example is given in Table]]] 


book a flight from Hong Kong to Seattle 

Dpt-city - Arv-city 


Table 1: An example of language understanding. Label names 
have been shortened to fit. Many words are labeled null or 

Given a T-length input word sequence xf, a correspond¬ 
ing output tag sequence yj , and an alignment A, the posterior 
probability p(yT\ A, xT) is approximated by 

T 

p{yT\xi )« Y\p{yt\x\t k h ), ( 1 ) 

t=l 

where k is the size of a context window and t indexes the posi¬ 
tions in the alignment. 






2.2. Simple recurrent neural networks 

The above posterior probability can be computed using a RNN. 
A RNN consists of an input layer xt, a hidden layer h t , and an 
output layer yt. In Elman architecture |16| , hidden layer activity 
h t is dependent on both the input xt and also recurrently on the 
past hidden layer activity ht- 1 . 

Because of the recurrence, the hidden layer activity ht is 
dependent on the observation sequence from its beginning. The 
posterior probability is therefore computed as follows 

T 

p(yi \ x T) « n**i*i) 

t= 1 
T 

t =i 

where the output yt and hidden layer activity ht are computed 
as 

yt = g(h t ), (3) 

h t = a(x t ,h t -i). (4) 

In the above equation, g(-) is softmax function and a(-) is sig¬ 
moid or tanh function. The above model is denoted as simple 
RNN, to contrast it with more advanced recurrent neural net¬ 
works described below. 

2.3. Recurrent neural networks using gating functions 

The current hidden layer activity ht of a simple RNN is related 
to its past hidden layer activity ht -i via the nonlinear function 
in Eq. 0 The non-linearity can cause errors back-propagated 
from ht to explode or to vanish. This phenomenon prevents 
simple RNN from learning patterns that are spanned with long 
time dependence |14| . 

To tackle this problem, long short-term memory (LSTM) 
neural network was proposed in 1171 with an introduction of 
memory cells, linearly dependent on their past values. LSTM 
also introduces three gating functions, namely input gate, forget 
gate and output gate. We follow a variant of LSTM in |18| . 

More recently, a gated recurrent neural network 
(GRNN) |6j was proposed. Instead of the three gating 
functions in LSTM, it uses two gates. 

One is a reset gate rt that relates a candidate activation with 
the past hidden layer activity ht-V, i.e., 

h t = tanh(W xh x t + W h h{nQ ht-t)) (5) 

where ht is the candidate activation. W x h and Whh are the 
matrices relate the current observation xt and the past hidden 
layer activity. 0 is element-wise product. 

The second gate is an update gate Zt that interpolates the 
candidate activation and the past hidden layer activity to update 
the current hidden layer activity; i.e., 

h t = (1 - zt) © h t -1 + z t © h t - (6) 

These gates are usually computed as functions of the cur¬ 
rent observation xt and the past hidden layer activity; i.e., 

r t = a(W xr x t + Whrht-i) (7) 

z t = cr(W xz x t + Whzht-i) (8) 

where W xr and Whr are the weights to observation and to the 
past hidden layer activity for the reset gate. W xz and Whz are 
similarly defined for the update gate. 



Figure 1: The RNN-EM model. The model reads input xt and 
outputs yt . Its hidden layer activity ht depends on the input and 
the model’s memory content retrieved in ct. ft and ut are the 
forget and update gates, kt- et and vt each denote key, erase 
and new content vector. Mt is the external memory, wt is the 
weight and it is a function of k t and M t . Z denotes a time- 
delay operator. The diamond symbol o denotes diagonal matrix 
multiplication. 


3. The RNN-EM architecture 

We extend simple RNN in this section to using external mem¬ 
ory. Figure[l]illustrates the proposed model, which we denote it 
as RNN-EM. Same as with the simple RNN, it consists of an in¬ 
put layer, a hidden layer and an output layer. However, instead 
of feeding the past hidden layer activity directly to the hidden 
layer as with the simple RNN, one input to the hidden layer is 
from a content of an external memory. RNN-EM uses a weight 
vector to retrieve the content from the external memory to use 
in the next time instance. The element in the weight vector is 
proportional to the similarity of the current hidden layer activ¬ 
ity with the content in the external memory. Therefore, content 
that is irrelevant to the current hidden layer activity has small 
weights. We describe RNN-EM in details in the following sec¬ 
tions. All of the equations to be described are with their bias 
terms, which we omit for simplicity of descriptions. We imple¬ 
mented RNN-EM using Theano (20j|2T). 

3.1. Model input and output 

The input to the model is a dense vector xt £ R dxl . In the 
context of language understanding, Xt is a projection of input 
words, also known as word embedding. 

The hidden layer reads both the input x t and a content Ct 
vector from the memory. The hidden layer activity is computed 
as follows 

h t = a(W ih Xt + W z ct) (9) 

where cr(-) is tanh function. W,t,. £ R pXd is the weight to the 
input vector. Ct £ R mxl is the content from a read operation 
to be described in Eq. {BJ. W c £ R pXm is the weight to the 
content vector. 

The output from this model is fed into the output layer as 
follows 

yt = g(W ho h t ) (10) 

where Who is the weight to the hidden layer activity and g{-) is 
softmax function. 















































Notice that in case of ct = ht-i, the above model is simple 
RNN. 

3.2. External memory read 

RNN-EM has an external memory M t G R mXn . It can be 
considered as a memory with n slots and each slot is a vector 
with m elements. Similar to the external memory in computers, 
the memory capacity of RNN-EM may be increased if using a 
large n. 

The model generates a key vector k t to search for content 
in the external memory. Though there are many possible ways 
to generate the key vector, we choose a simple linear function 
that relates hidden layer activity ht as follows 


Method 

FI score 

CRF 1261 

92.94 

simple RNN |4| 

94.11 

CNN |,27l 

94.35 

LSTMT28) 

94.85 

GRNn' 

94.82 

RNN-EM 

95.25 


Table 2: FI scores (in %) on ATIS. 


RNN-EM has an update gate ut . It simply uses the weight 
Wt as follows 


Ut = Wt- 


(18) 


kt — Wkht 


( 11 ) 


where Wk G R mxp is a linear transformation matrix. Our intu¬ 
ition is that the memory should be in the same space of or affine 
to the hidden layer activity. 

We use cosine distance K (it, v) = m Jj.'T ,i to compare this 
key vector with contents in the external memory. The weight 
for the c-th slot M t (:, c) in memory M t is computed as follows 


exp fj t K(kt, M t (\, c)) 
E 9 ex P PtK(k t ,M t {:,q)) 


( 12 ) 


Therefore, memory is only updated if it is to be read. 

With the above described two gates, the memory is updated 
as follows 


M t = diag(f t )M t -i + diag{u t )v t (19) 

where diagf) transforms a vector to a diagonal matrix with 
diagonal elements from the vector. 

Notice that when the number of memory slots is small, it 
may have similar performances as a gated RNN. Specifically, 
when n = 1, Eqs. ( |19| ) and (j6j are qualitatively similar. 


where the above weight is normalized and sums to 1.0. fit is a 
scalar larger than 0.0. It sharpens the weight vector when fit is 
larger than 1.0. Conversely, it smooths or dampens the weight 
vector when fit is between 0.0 and 1.0. We use the following 
function to obtain fit ; i.e., 

fit = log(l + exp(Wpht)) (13) 

where Wg G R 1Xp maps the hidden layer activity h t to a scalar. 

Importantly, we also use a scalar coefficient g t to interpo¬ 
late the above weight estimate with the past weight as follows: 

w t = (1 - gt)w t -i + gtw t (14) 

This function is similar to Eq. in the gated RNN, except 
that we use a scalar g t to interpolate the weight updates and the 
gated RNN uses a vector to update its hidden layer activity. 

The memory content is retrieved from the external memory 
at time t — 1 using 

c t = Mt-iWt-i- (15) 

3.3. External memory update 

RNN-EM generates a new content vector vt to be added to its 
memory; i.e, 

vt = W v h t (16) 

where W,, G R mXp . We use the above linear function based 
on the same intuition in Sec. 13.21 that the new content and the 
hidden layer activity are in the same space of or affine to each 
other. 

RNN-EM has a forget gate as follows 


4. Experiments 

4.1. Dataset 

In order to compare the proposed model with alternative mod¬ 
eling techniques, we conducted experiments on a well studied 
language understanding dataset, Air Travel Information System 
(ATIS) |22|]24| . The training part of this dataset consists of 
4978 sentences and 56590 words. There are 893 sentences and 
9198 words for test. The number of semantic label is 127, in¬ 
cluding the common null label. We use lexicon-only features in 
experiments. 

4.2. Comparison with the past results 

The input xt in RNN-EM has a window size of 3, consisting of 
the current input word and its neighboring two words. We use 
the AdaDelta method to update gradients |25) . The maximum 
number of training iterations was 50. Hyper parameters for tun¬ 
ing included the hidden layer size p, the number of memory 
slots n, and the dimension for each memory slot m. The best 
performing RNN-EM had 100 dimensional hidden layer and 8 
memory slots with 40 dimensional memory slot. 

Table[2]lists performance in FI score of RNN-EM, together 
with the previous best results of alternative models in the lit¬ 
erature. Since there are no previous results from GRNN, we 
use our own implementation of it for this study. These results 
are optimal in their respective systems. The previous best result 
was achieved using LSTM. A change of 0.38% of FI score from 
LSTM result is significant at the 90% confidence level. Results 
in Table [2] show that RNN-EM is significantly better than the 
previous best result using LSTM. 


f t = 1 - Wt 0 e t (17) 

where et G R uxl is an erase vector, generated as et = 
criWheht). Notice that the c-th element in the forget gate is 
zero only if both read weight wt and erase vector et have their 
c-th element set to one. Therefore, memory cannot be forgotten 
if it is not to be read. 


4.3. Analysis on convergence and averaged performances 

Results in the previous sections were obtained with models us¬ 
ing different sizes. This section further compares neural net¬ 
work models given that they have approximately the same num¬ 
ber of parameters, listed in Table [3] We use AdaDelta |25| gra¬ 
dient update method for all these models. Figure [2] plots their 









Model 

hidden layer dimension 

# of Parameters 

simple RNN 

115 

« 7.4 * 10 3 

LSTM 

50 

« 7.5 * 10 3 

GRNN 

60 

« 7.4* 10 3 

RNN-EM T 

100,40 x 8 

« 7.3* 10 3 


f 100 dimensional hidden layer, 40 dimensional slot with 8 
slots. 


Table 3: The size of each neural network models. 



Epoch 

Figure 2: Convergence of training entropy. The entropy value 
has been converted to its logarithm. 


training set entropy with respect to iteration numbers. To better 
illustrate their convergences, we have converted entropy values 
to their logarithms. The results show that RNN-EM converges 
to lower training entropy than other models. RNN-EM also con¬ 
verges faster than the simple RNN and LSTM. 

We further repeated ATIS experiments for 10 times with 
different random seeds for these neural network models. We 
evaluated their performances after their convergences. Table [4] 
lists their averaged FI scores, together with their maximum and 
minimum FI scores. A change of 0.12% is significant at the 
90% confidence level, when comparing against LSTM result. 
Results in Table[4]show that RNN-EM, on average, significantly 
outperforms LSTM. The best performance by RNN-EM is also 
significantly better than the best performing LSTM. 

4.4. Analysis on memory size 

The size of the external memory M t is proportional to the num¬ 
ber of memory slots n. We fixed the dimension of memory slots 
to 40 and varied the number of slots. Table [5] lists their test set 
FI scores. The best performing RNN-EM was with n = 8. No¬ 
tice that RNN-EM with n = 1 performed better than the simple 
RNN with 94.09% FI score in Table[4] This can be explained as 
using gate functions in Eqs. {T7} and {78} in RNN-EM, which 
are absent in simple RNNs. RNN-EM with n = 1 also per¬ 
formed similarly as the gated RNN with 94.70% FI score in 
Table[4] partly because of these gate functions. 

Memory capacity may be measured using training set en¬ 
tropy. Table [5] shows that training set entropy is decreased ini¬ 
tially with n increased from 1 to 8, showing that the memory 
capacity of the RNN-EM is improved. Flowever, the entropy is 
increased with ns further increased. This suggests that memory 


Method 

Max 

Min 

Averaged 

simple RNN 

94.09 

93.64 

93.80 

LSTM 

94.81 

94.62 

94.73 

GRNN 

94.70 

94.32 

94.61 

RNN-EM 

95.22 

94.71 

94.96 


Table 4: The maximum, minimum and averaged FI scores (in 
%) by neural network models. 


slot number n 

1 

2 

4 

8 

16 

FI score 

94.67 

94.87 

94.91 

95.22 

94.75 

entropy x 10 3 

2.23 

1.96 

1.91 

1.90 

2.05 

slot number n 

32 

64 

128 

256 

512 

FI score 

94.87 

94.77 

94.57 

94.84 

94.53 

entropy x 10 3 

2.16 

2.30 

2.36 

3.43 

6.10 


Table 5: Test set FI scores (in %) and training set entropy by 
RNN-EM with different slot numbers. 


capacity of RNN-EM cannot be increased simply by increasing 
the number of slots. 

5. Related works 

The RNN-EM is along the same line of research in |19[|29| 
that uses external memory to improve memory capacity of neu¬ 
ral networks. Perhaps the closest work is the Neural Turing 
Machine (NTM) work in ]19| , which focuses on those tasks 
that require simple inference and has proved its effectiveness 
in copy, repeat and sorting tasks. NTM requires complex mod¬ 
els because of these tasks. The proposed model is considerably 
simpler than NTM and can be considered as an extension of 
simple RNN. Importantly, we have shown through experiments 
on a common language understanding dataset the promising re¬ 
sults from using the external memory architecture. 

6. Conclusions and discussions 

In this paper, we have proposed a novel neural network architec¬ 
ture, RNN-EM, that uses external memory to improve memory 
capacity of simple recurrent neural networks. On a common 
language understanding task, RNN-EM achieves new state-of- 
the-art results and performs significantly better than the previ¬ 
ous best result using long short-term memory neural networks. 
We have conducted experiments to analyze its convergence and 
memory capacity. These experiments provide insights for future 
research directions such as mechanisms of accessing memory 
contents and methods to increase memory capacity. 
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